118 ◾ Bioinformatics
FASTQ files). When the FASTQ files have been downloaded successfully as shown in
Figure 4.3, we can use “ls fastq” to display the files in the new created directory.
4.2.1.1.2 Downloading and indexing the reference genome sequence
The reads in the FASTQ files must be aligned to a reference genome of the organism stud-
ied. Therefore, the latest FASTA file of the reference genome sequence is downloaded from
a genome database into a local drive. The NCBI Genome database “https://www.ncbi.nlm.
nih.gov/genome/” is one of the databases that curates reference genome sequences. We can
use the database query box to search for the latest reference genome of “SARS-CoV-2” and
copy the link to the FASTA sequence of the reference genome. The following bash script
creates the “ref” subdirectory, downloads the compressed FASTA sequence of the latest
SARS-CoV-2 reference genome into that subdirectory, and decompresses the FASTA file.
Notice that the URL of the reference sequence may change if a new version is available.
Therefore, visit the reference genome page for the latest sequence. When you use the fol-
lowing script, make sure that “wget” and the URL are in the same line and that there is no
whitespace in the URL. After downloading the reference sequence, use “ls” to make sure
the file has been downloaded and decompressed.
mkdir ref
cd ref
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/
GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.
fna.gz
f=$(ls *.*)
gzip -d ${f}
4.2.1.1.3 Indexing the FASTA file of the reference genome
As discussed in Chapter 2, we need to index the FASTA sequence of the reference genome
with both “samtools faidx” and the aligner used for mapping. In this example, we will use
“bwa” aligner; therefore, we will use “bwa index” for indexing as well. The following bash
script uses “samtools” and bwa” to index the reference genome:
f=$(ls *.*)
samtools faidx ${f}
bwa index ${f}
cd ..
When you display the content of the “ref” subdirectory, you may see the following files if
you follow the above steps successfully:
GCF_009858895.2_ASM985889v3_genomic.fna
GCF_009858895.2_ASM985889v3_genomic.fna.amb
GCF_009858895.2_ASM985889v3_genomic.fna.ann
GCF_009858895.2_ASM985889v3_genomic.fna.bwt